Context

The Thera bank recently saw a steep decline in the number of users of their credit card. Credit cards are a good source of income for banks because of different kinds of fees charged by the banks like annual fees, balance transfer fees, and cash advance fees, late payment fees, foreign transaction fees, and others. Some fees are charged to every user irrespective of usage, while others are charged under specified circumstances. Customers’ leaving credit cards services would lead bank to loss, so the bank wants to analyze the data of customers and identify the customers who will leave their credit card services and reason for same – so that bank could improve upon those areas. As a data scientist at Thera bank there is a need to come up with a classification model that will predict if the customer is going to churn or not. This will help the bank improve its services so that customers do not renounce their credit cards. There is a need to identify the best possible model that will give the required performance.

Objective

Dataset Description

Data Dictionary:

Key Questions:

Importing libraries

Loading Data

Data Overview

Let's check the number of unique values in each column

Checking the value count for each category of categorical variables

Univariate analysis

Observation on Age

Observation on months on book

Observations on Total relationship count

Observations on Months inactive in last 12 months

Observations on Contacts between customer and bank in past 12 months

Observations on Credit limit

Observations on Total Revolving Balance

Observations on Average open to buy

Observations on Total Trans Amt

Observations on Total_Trans_Ct

Observations on Total_Ct_Chng_Q4_Q1

Observations on Total_Amt_Chng_Q4_Q1

Observations on Avg_Utilization_Ratio

Observations on Attrition Flag

Observations on Gender

Observations on Dependant count

Observations on Education_Level

Observations on Marital_Status

Observations on Income_Category

Observations on Card_Category

Bivariate Analysis

Impact of Customer Age on Attrition

Impact of Dependent count on Attrition

Impact of Months on book on Attrition

Impact of Total_Relationship_Count on Attrition

Impact of Months_Inactive_12_mon on Attrition

Impact of Contacts_Count_12_mon on Attrition

Impact of Credit_Limit on Attrition

Impact of Total_Revolving_Bal

Impact of Avg_Open_To_Buy

Impact of Total_Amt_Chng_Q4_Q1

Impact of Total_Trans_Amt

Impact of Total_Trans_Ct

Impact of Total_Ct_Chng_Q4_Q1

Impact of Avg_Utilization_Ratio

Detecting Outliers

Data Preparation for Modeling

Split data

Missing-Value Treatment

Model evaluation criterion

We will be using Recall as a metric for our model performance because here company could face 2 types of losses

  1. Identifying a churner as non churner - Loss of money and business due to not focussing on him
  2. Identifying a non-churner as churner- Loss of time and effort on focsuing on him when not required

Which Loss is greater?

How to reduce this loss i.e need to reduce False Negatives?

Hyperparameter Tuning

**We will tune Xgboost , GBM and AdaBoost models using RandomizedSearchCV.

First let's create two functions to calculate different metrics and confusion matrix, so that we don't have to use the same code repeatedly for each model.

XGBoost

RandomizedSearchCV

AdaBoostClassifier

GradientBoost classifier

SMOTE to upsample smaller class

Applying Upsampling on tuned adaboost

Training Set

Validation Set

Applying Upsampling on tuned gradient boost

Applying Upsampling on tuned XGBoost

Downsampling with Cluster centroids

Applying Downsampling on tuned adaboost

Training Set

Applying Downsampling on tuned gradient boosting

Training Set

Validation Set

Applying Downsampling on tuned XGBoost

Training Set

How did we prevent Data Leaks?

Applying the tuned Gradient Boost Downsampling on test data

Pipelines for productionizing the model

Column Transformer

Conclusion and Insights